Temporal-difference (TD) learning represents a paradigm shift in reinforcement learning. It bridges the gap between the raw sampling of Monte Carlo methods and the mathematical elegance of Dynamic Programming. At its core, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
The Driving Home Analogy
Imagine you are driving home. In a Monte Carlo world, you only update your belief about your commute time once you step through your front door. If you hit a massive traffic jam 10 minutes in, you just sit there, unable to 'learn' until the journey ends. In the TD Learning world, the moment you see those brake lights, you immediately adjust your estimate of your total travel time. You don't need the final outcome to know your initial prediction was wrong.
The Bucket Brigade Intuition
Think of one-step, tabular, model-free TD methods as a bucket brigade. Instead of one person running from the fire to the well, a line of people passes buckets of information back. As soon as State B is reached, its value is used to correct State Aβs value. This incremental nature allows for faster convergence and enables learning in continuing tasks that have no natural end.